Biocuration In PubMed Until 2020
   HOME

TheInfoList



OR:

Biocuration is the field of
life sciences This list of life sciences comprises the branches of science that involve the scientific study of life – such as microorganisms, plants, and animals including human beings. This science is one of the two major branches of natural science, the ...
dedicated to organizing biomedical data, information and knowledge into structured formats, such as
spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cel ...
s,
tables Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
and
knowledge graphs Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is distinc ...
. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators,
software developers A computer programmer, sometimes referred to as a software developer, a software engineer, a programmer or a coder, is a person who creates computer programs — often for larger computer software. A programmer is someone who writes/creates ...
and bioinformaticians and is at the base of the work of
biological database Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...
s.


Biocuration as a profession

A biocurator is a professional
scientist A scientist is a person who conducts Scientific method, scientific research to advance knowledge in an Branches of science, area of the natural sciences. In classical antiquity, there was no real ancient analog of a modern scientist. Instead, ...
who
curates A curate () is a person who is invested with the ''care'' or ''cure'' (''cura'') ''of souls'' of a parish. In this sense, "curate" means a parish priest; but in English-speaking countries the term ''curate'' is commonly used to describe clergy w ...
, collects, annotates, and validates information that is disseminated by
biological Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary in ...
and
model organism database Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets ...
s. It is a new profession, with the first mentions in the scientific literature dating of 2006 in the context of the work in databases like the
Immune Epitope Database and Analysis Resource The Immune Epitope Database and Analysis Resource (IEDB) is a project hosted by scientists at the La Jolla Institute for Allergy and Immunology (LIAI), with support from th(NIAID), a part of th{Dead link, date=January 2020 , bot=InternetArchiveBot ...
. Biocurators usually are
PhD PHD or PhD may refer to: * Doctor of Philosophy (PhD), an academic qualification Entertainment * '' PhD: Phantasy Degree'', a Korean comic series * ''Piled Higher and Deeper'', a web comic * Ph.D. (band), a 1980s British group ** Ph.D. (Ph.D. albu ...
-level with a mix of experiences in
wet lab A wet lab, or experimental lab, is a type of laboratory where it is necessary to handle various types of chemicals and potential "wet" hazards, so the room has to be carefully designed, constructed, and controlled to avoid spillage and contamination ...
and computational representations of
knowledge Knowledge can be defined as awareness of facts or as practical skills, and may also refer to familiarity with objects or situations. Knowledge of facts, also called propositional knowledge, is often defined as true belief that is distinc ...
(e.g. via
ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
). The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard
annotation An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
protocols and vocabularies that enable powerful queries and
biological database Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...
interoperability. Biocurators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories. Biocurators are present in diverse research environments, but may not self-identify as biocurators. Projects such as
ELIXIR ELIXIR (the European life-sciences Infrastructure for biological Information) is an initiative that will allow life science laboratories across Europe to share and store their research data as part of an organised network. Its goal is to bring t ...
(the European life-sciences Infrastructure for biological Information) and GOBLET (Global Organization for Bioinformatics Learning, Education and Training) promote training and support biocuration as a career path. In 2011, biocuration was already recognized as a profession, but there were no formal degree courses to prepare curators for biological data in a targeted fashion. With the growth of the field, the
University of Cambridge , mottoeng = Literal: From here, light and sacred draughts. Non literal: From this place, we gain enlightenment and precious knowledge. , established = , other_name = The Chancellor, Masters and Schola ...
and the
EMBL-EBI The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
started to jointly offer a Postgraduate Certificate in Biocuration, considered as a step towards recognising biocuration as a discipline on its own. There is a perceived increase in demand of biocuration, and a need for additional biocuration training by
graduate programs Postgraduate or graduate education refers to academic or professional degrees, certificates, diplomas, or other qualifications pursued by post-secondary students who have earned an undergraduate ( bachelor's) degree. The organization and stru ...
. Organizations that employ biocurators, like
Clinical Genome Resource The Clinical Genome Resource (ClinGen) is an initiative developed by the National Human Genome Research Institute (NHGRI) with the aim of creating a knowledge base of the clinical relevance of genes and variants for use in precision medicine and res ...
(ClinGen), often provide specialized materials and training for biocuration.


Biological knowledgebases

The role of biocurators is best known among the field of biological knowledgebases. Such databases, like
UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
and PDB rely on professional biocurators to organize information. Among other things, biocurators work to improve the data quality, for example, by merging duplicated entries. An important part of those knowledgebases are model organisms databases, which rely on biocurators to curate information regarding organisms of particular kinds. Some notable examples of model organism databes are
FlyBase FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, ''Drosophila melanogaster'', a wide range of ...
,
PomBase PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase webs ...
, and
ZFIN The Zebrafish Information NetworkZFIN is an online biological database of information about the zebrafish (''Danio rerio''). The zebrafish is a widely used model organism for genetic, genomic, and developmental studies, and ZFIN provides an integra ...
, dedicated to curate information about ''
Drosophila ''Drosophila'' () is a genus of flies, belonging to the family Drosophilidae, whose members are often called "small fruit flies" or (less frequently) pomace flies, vinegar flies, or wine flies, a reference to the characteristic of many species ...
,
Schizosaccharomyces pombe ''Schizosaccharomyces pombe'', also called "fission yeast", is a species of yeast used in traditional brewing and as a model organism in molecular and cell biology. It is a unicellular eukaryote, whose cells are rod-shaped. Cells typically meas ...
'' and
zebrafish The zebrafish (''Danio rerio'') is a freshwater fish belonging to the minnow family ( Cyprinidae) of the order Cypriniformes. Native to South Asia, it is a popular aquarium fish, frequently sold under the trade name zebra danio (and thus often ...
respectively.


Curation and annotation

Biocuration is the integration of biological information into on-line databases in a semantically standardized way, using appropriate unique traceable identifiers, and providing necessary metadata including source and provenance.


Ontologies, controlled vocabularies and standard names

Biocurators commonly employ and take part in the creation and development of shared biomedical
ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
: structured,
controlled vocabularies Control may refer to: Basic meanings Economics and business * Control (management), an element of management * Control, an element of management accounting * Comptroller (or controller), a senior financial officer in an organization * Controlling ...
that encompass many biological and medical knowledge domains, such as the
Open Biomedical Ontologies The Open Biological and Biomedical Ontologies (OBO) Foundry is a group of people dedicated to build and maintain ontologies related to the life sciences. The OBO Foundry establishes a set of principles for ontology development for creating a su ...
. These domains include
genomics Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
and
proteomics Proteomics is the large-scale study of proteins. Proteins are vital parts of living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replication of DNA. In ...
,
anatomy Anatomy () is the branch of biology concerned with the study of the structure of organisms and their parts. Anatomy is a branch of natural science that deals with the structural organization of living things. It is an old science, having its ...
, animal and plant
development Development or developing may refer to: Arts *Development hell, when a project is stuck in development *Filmmaking, development phase, including finance and budgeting *Development (music), the process thematic material is reshaped *Photographi ...
,
biochemistry Biochemistry or biological chemistry is the study of chemical processes within and relating to living organisms. A sub-discipline of both chemistry and biology, biochemistry may be divided into three fields: structural biology, enzymology and ...
,
metabolic pathways In biochemistry, a metabolic pathway is a linked series of chemical reactions occurring within a cell. The reactants, products, and intermediates of an enzymatic reaction are known as metabolites, which are modified by a sequence of chemical reac ...
,
taxonomic classification In biology, taxonomy () is the scientific study of naming, defining ( circumscribing) and classifying groups of biological organisms based on shared characteristics. Organisms are grouped into taxa (singular: taxon) and these groups are given ...
, and mutant
phenotypes In genetics, the phenotype () is the set of observable characteristics or traits of an organism. The term covers the organism's morphology or physical form and structure, its developmental processes, its biochemical and physiological proper ...
. Given the variety of existing ontologies, there are guidelines that orient researchers on how to choose a suitable one. The
Unified Medical Language System The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created 1986). It provides a mapping structure among these vocabularies and thus allows one to translate among the various termin ...
is one such systems that integrates and distributes millions of terms used in the life sciences domain. Biocurators enforce the consistent use of
gene nomenclature Gene nomenclature is the scientific naming of genes, the units of heredity in living organisms. It is also closely associated with protein nomenclature, as genes and the proteins they code for usually have similar nomenclature. An international co ...
guidelines and participate in the genetic nomenclature committees of various
model organisms A model organism (often shortened to model) is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the working ...
, often in collaboration with the
HUGO Hugo or HUGO may refer to: Arts and entertainment * ''Hugo'' (film), a 2011 film directed by Martin Scorsese * Hugo Award, a science fiction and fantasy award named after Hugo Gernsback * Hugo (franchise), a children's media franchise based on ...
Gene Nomenclature Committee
HGNC
. They also enforce other nomenclature guidelines like those provided by the Nomenclature Committee of the
International Union of Biochemistry and Molecular Biology The International Union of Biochemistry and Molecular Biology (IUBMB) is an international non-governmental organisation concerned with biochemistry and molecular biology. Formed in 1955 as the International Union of Biochemistry (IUB), the union ...
(IUBMB), one example of which is the Enzyme Commission EC number. More generally, the use of
persistent identifier A persistent identifier (PI or PID) is a long-lasting reference to a document, file, web page, or other object. The term "persistent identifier" is usually used in the context of digital objects that are accessible over the Internet. Typically, s ...
s is praised by the community, so to improve clarity and facilitate knowledge


DNA annotation

In
genome annotation DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanati ...
for example, the identifiers defined by the ontologists and consortia are used to describe parts of the genome. For example, the
gene ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and g ...
(GO) curates terms for biological processes, which are used to describe what we know about specific
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
s.


Text annotation

As of 2021, life sciences communication is still done primarily via free natural languages, like
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
or
German German(s) may refer to: * Germany (of or related to) **Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ger ...
, which hold a degree of ambiguity and make it hard to connect knowledge. So, besides annotating biological sequences, biocurators also annotate texts, linking words to unique identifiers. This aids in disambiguation, clarifying the meaning intended, and making the texts processable by computers. One application of text annotation is to specify the exact gene a scientist is referring to. Publicly available text annotations make it possible to biologists to take further advantage of biomedical text. The Europe PMC has an
Application Programming Interface An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how t ...
which centralizes text annotations from a variety of sources and make them available in a
Graphic User Interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
called SciLite. The PubTator Central also provides annotations, but is fully based on computerized text-mining and does not provide a user interface. There are also programs that allow users to manually annotate the biomedical texts they are interested, such as the ezTag system.


International Society for Biocuration (ISB)

The
International Society for Biocuration The International Society for Biocuration (ISB) is a non-profit organisation that promotes the field of biocuration and was founded in early 2009. It provides a forum for information exchange through meetings and workshops. The society's conferen ...
(ISB) is a non-profit organisation "promotes the field of biocuration and provides a forum for information exchange through meetings and workshops." It has grown from the International Biocuration Conferences and founded in early 2009. The ISB offers the Biocuration Career Award to biocurators in the community: the Biocurator Career Award (given annually) and the ISB Award for Exceptional Contributions to Biocuration (given biannually). The official journal of the ISB, ''Database'', is a venue specialized in articles about databases and biocuration.


Community curation

Traditionally, biocuration has been done by dedicated experts, which integrate data into databases. Community curation has emerged as a promising approach to improve the dissemination of knowledge from published data and provide a cost-effective way to improve the scalability of biocuration. In some cases, community help is leveraged in jamborees that introduce domain experts to curation tasks, carried during the event, while others rely on asynchronous contributions of experts and non-experts.


Biological databases

Several biological databases include author contributions in their functional curation strategy to some extent, which may range from associating gene identifiers with publications or free-text, to more structured and detailed annotation of sequences and functional data, outputting curation to the same standards as professional biocurators. Most community curation at
Model Organism Databases Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large set ...
involves annotation by original authors of published research (first-pass annotation) to effectively obtain accurate identifiers for objects to be curated, or identify data-types for detailed curation. For example: * WormBase successfully solicits first-pass annotation from users and has integrated author curation with the micropublication process. WormBase also integrates text-mining to its platform, providing suggestions to community curators. *
FlyBase FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, ''Drosophila melanogaster'', a wide range of ...
sends email requests to authors of new publications, inviting them to list the genes and data types described via an online tool and has also mobilized the community to write gene summary paragraphs. Other databases, such as
PomBase PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase webs ...
, rely on publication authors to submit highly detailed, ontology-based annotations for their publications, and meta-data associated with genome-wide data-sets using controlled vocabularies. A web-based tool
Canto The canto () is a principal form of division in medieval and modern long poetry. Etymology and equivalent terms The word ''canto'' is derived from the Italian word for "song" or "singing", which comes from the Latin ''cantus'', "song", from the ...
; was developed to facilitate community submissions. Since Canto is freely available, generic and highly configurable, it has been adopted by other projects. Curation is subjected to review by professional curators resulting in high quality in-depth curation of all molecular data-types. The widely used UniProt knowledgebase has also a community curation mechanism that allows researchers to add information about proteins.


Wiki-style resources

Bio-wikis rely on their communities to provide content and a series of wiki-style resources are available for biocuration. ''AuthorReward'', for example, is an extension to MediaWiki that quantifies researchers' contributions to biowikis. RiceWiki was an example of a wiki-based database for community curation of rice genes equipped with ''AuthorReward''. CAZypedia is another such wiki for communiy biocuration of information on carbohydrate-active enzymes (CAZys). The WikiProteins/WikiProfessional was a project to semantically organize biological data led by
Barend Mons Barend Mons (born 1957, The Hague) is a molecular biologist by training and a leading FAIR data specialist. The first decade of his scientific career he spent on fundamental research on malaria parasites and later on translational research for ...
. The 2007 project had direct contributions of
Jimmy Wales Jimmy Donal Wales (born August 7, 1966), also known on Wikipedia by the pseudonym Jimbo, is an American-British Internet entrepreneur, webmaster, and former financial trader. He is a co-founder of the online non-profit encyclopedia Wikipedi ...
, Wikipedia co-founder, and took
Wikidata Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, can use under the CC0 public domain license. ...
as an inspiration. A currently active project that runs on an adaptation of mediawiki software is
WikiPathways WikiPathways is a community resource for contributing and maintaining content dedicated to biological pathways. Any registered WikiPathways user can contribute, and anybody can become a registered user. Contributions are monitored by a group of a ...
, which crowdsources information about
biological pathway A biological pathway is a series of interactions among molecules in a cell that leads to a certain product or a change in a cell. Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on a ...
s.


Wikipedia

There is some overlap between the work of biocurators and
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...
, with boundaries between scientific
databases In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
and Wikipedia becoming increasingly blurred. Databases like
Rfam Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janel ...
and the
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cry ...
for example make heavy use of Wikipedia and its editors to curate information. However, most databases offer highly structured data that is searchable in complex combinations, which is usually not possible on Wikipedia, although
Wikidata Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, can use under the CC0 public domain license. ...
aims at solving this problem to some extent. The
Gene Wiki The Gene Wiki is a project within Wikipedia that aims to describe the relationships and functions of all human genes. It was established to transfer information from scientific resources to Wikipedia stub articles. The Gene Wiki project also init ...
project used Wikipedia for collaborative curation of thousands of genes and gene products, such as
titin Titin (contraction for Titan protein) (also called connectin) is a protein that in humans is encoded by the ''TTN'' gene. Titin is a giant protein, greater than 1 µm in length, that functions as a molecular spring that is responsible for t ...
and
insulin Insulin (, from Latin ''insula'', 'island') is a peptide hormone produced by beta cells of the pancreatic islets encoded in humans by the ''INS'' gene. It is considered to be the main anabolic hormone of the body. It regulates the metabolism o ...
. Several projects also employ Wikipedia as a platform for curation of medical information. One other way that Wikipedia is used for biocuration is via its list articles. For example, the Comprehensive Antibiotic Resistance Database integrates its assessment of databases about
antibiotic resistance Antimicrobial resistance (AMR) occurs when microbes evolve mechanisms that protect them from the effects of antimicrobials. All classes of microbes can evolve resistance. Fungi evolve antifungal resistance. Viruses evolve antiviral resistance. ...
to a particular Wikipedia list.


Wikidata

The Wikimedia knowledge base
Wikidata Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia, and anyone else, can use under the CC0 public domain license. ...
is increasingly being used by the biocuration community as an integrative repository across life sciences. Wikidata is being seen by some as an alternative with better prospects of maintenance and interoperability than smaller independent biological knowledge bases. Wikidata has been used to curate information on
SARS-CoV-2 Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic. The virus previously had a ...
and the
COVID-19 pandemic The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The novel virus was first identif ...
and by the
Gene Wiki The Gene Wiki is a project within Wikipedia that aims to describe the relationships and functions of all human genes. It was established to transfer information from scientific resources to Wikipedia stub articles. The Gene Wiki project also init ...
project to curate information about
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
s. Data from biocuration on Wikidata is reused on external resources via SPARQL queries. Some projects use curation via Wikidata as a path to improve life-sciences information on Wikipedia.


Gamified resources

An approach to involve the crowd in biocuration is via gamified platforms that use
game design Game design is the art of applying design and aesthetics to create a game for entertainment or for educational, exercise, or experimental purposes. Increasingly, elements and principles of game design are also applied to other interactions, in ...
principles to boost engagement. A few examples are: * Mark2Cure, a gamified platform for community curation of biomedical abstracts * Cochrane Crowd, a platform by
Cochrane Cochrane may refer to: Places Australia *Cochrane railway station, Sydney, a railway station on the closed Ropes Creek railway line Canada * Cochrane, Alberta * Cochrane Lake, Alberta * Cochrane District, Ontario ** Cochrane, Ontario, a town wit ...
for curation of
clinical trial Clinical trials are prospective biomedical or behavioral research studies on human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel vaccines, drugs, dietar ...
s and to categorize and summarize biomedical literature. *CIViC, a portal for annotation of genomic variants related to
cancer Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. These contrast with benign tumors, which do not spread. Possible signs and symptoms include a lump, abnormal b ...
which tracks scores and keeps leaderboards. *APICURON, a database to credit and acknowledge the work of biocurators, that collects and aggregates biocuration events from third party resources and generates achievements and leaderboards.


Computational text mining for curation

Natural-language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
and
text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
technologies can help biocurators to extract of information for manual curation. Text mining can scale curation efforts, supporting the identification of gene names, for example, as well as for partially inferring
ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
. The conversion of unstructured assertions to structured information makes use of techniques like
named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
and
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
of dependencies. Text-mining of biomedical concepts faces challenges regarding variations in reporting, and the community is working to increase the machine-readability of articles. During the
COVID-19 pandemic The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The novel virus was first identif ...
, biomedical text mining was heavily used to cope with the large amount of published scientific research about the topic (over 50.000 articles). The popular NLP
python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
package
SpaCy spaCy ( ) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines ...
has a modification for biomedical texts, SciSpaCy, which is maintained by the
Allen Institute for AI The Allen Institute for AI (abbreviated AI2) is a research institute founded by late Microsoft co-founder Paul Allen. The institute seeks to achieve scientific breakthroughs by constructing AI systems with reasoning, learning, and reading capabi ...
. Among the challenges for text-mining applied to biocuration is the difficulty of accessing full texts of biomedical articles due to pay wall, linking the challenges of biocuration to those of the open-access movement. A complementary approach to biocuration via text mining involves applying
optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scen ...
to biomedical figures, coupled to automatic annotation algorithms. This has been used to extract gene information from pathway figures, for example. Suggestions to improve the written text to facilitate annotations range from using
controlled natural languages Control may refer to: Basic meanings Economics and business * Control (management), an element of management * Control, an element of management accounting * Comptroller (or controller), a senior financial officer in an organization * Controllin ...
to providing clear association of concepts (such as
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
s and
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
s) with the particular
species In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate s ...
of interest. While challenges remain, text-mining is already an integral part of the workflow of biocuration in several biological knowledgebases.


Biocreative challenges

The interface between text-mining and biocuration has been propelled up by the
BioCreAtIvE BioCreAtIvE (A critical assessment of text mining methods in molecular biology) consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain. It was preceded by the Knowledge Disco ...
(Critical Assessment of Information Extraction systems in Biology) challenges, a series of text-mining competitions that occurred for the first time in 2004.


See also

*
AgBase AgBase is a curated genomic database containing functional annotations of agriculturally important animals, plants, microbes and parasites. AgBase biocurators provides annotation of Gene Ontology terms and Plant ontology terms for gene products. By ...
*
Biological database Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...
*
Digital curation Digital curation is the selection, preservation, maintenance, collection and archiving of digital assets. Digital curation establishes, maintains and adds value to repositories of digital data for present and future use. This is often accomplished ...
*
International Society for Biocuration The International Society for Biocuration (ISB) is a non-profit organisation that promotes the field of biocuration and was founded in early 2009. It provides a forum for information exchange through meetings and workshops. The society's conferen ...
*
Model Organism Database Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets ...
*
OBO Foundry The Open Biological and Biomedical Ontologies (OBO) Foundry is a group of people dedicated to build and maintain ontologies related to the life sciences. The OBO Foundry establishes a set of principles for ontology development for creating a su ...


References

{{reflist


External links


International Society for Biocuration

Biocreative

Online course on biocuration at EMBL-EBI
Biological databases